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j^ ■ Abstract. Completely automatic and adaptive non-parametric inference is a pie 

in the sky. The frequentist approach, best exemplified by the kernel estimators, 
has excellent asymptotic characteristics but it is very sensitive to the choice of 
smoothness parameters. On the other hand the Bayesian approach, best exem- 
plified by the mixture of gaussians models, is optimal given the observed data 
but it is very sensitive to the choice of prior. In 1984 the author proposed to use 
the Cross- Validated gaussian kernel as the likelihood for the smoothness scale pa- 
^. I rameter h, and obtained a closed formula for the posterior mean of h based on 

r^ ■ Jeffreys's rule as the prior. The practical operational characteristics of this bayes' 

OhI rule for the smoothness parameter remained unknown for all these years due to the 

combinatorial complexity of the formula. It is shown in this paper that a version 
^\J I of the metropolis algorithm can be used to approximate the value of h producing 

^ ■ remarkably good completely automatic and adaptive kernel estimators. A close 

study of the form of the cross validated likelihood suggests a modification and a 
^i ■ new approach to Bayesian Non-parametrics in general. 

(N 
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.^ ■ 1. Introduction 

C/3 _ 

^^. A basic problem of statistical inference is to estimate the probability distribution 

that generated the observed data. In order to allow the data to speak by themselves, 
it is desirable to solve this problem with a minimum of a priori assumptions on 
the class of possible solutions. For this reason, the last thirty years have been 
^^ , burgeoning with interest in nonparametric estimation. The main positive results 

V^ ' are almost exclusively from non-Bayesian quarters, but due to recent advances in 

Monte Carlo methods and computer hardware, Bayesian nonparametric estimators 
are now becoming more competitive. 
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This paper aims to extend the idea in ^ to a new general Bayesian approach 
to nonparametric density estimation. As an illustration one of the techniques is 
applied to compute the estimator in [nl . In particular it is shown how to approxi- 
mate the bayes estimator (for Jeffreys' prior and quadratic loss) for the bandwidth 
parameter of a kernel using a version of the metropolis algorithm. The computer ex- 
periments with simulated data clearly show that the bayes estimator with Jeffreys' 
prior outperforms the standard estimator produced by plain Cross-validation. 

2. Density Estimation by Summing Over Paths 

Any non-Bayesian nonparametric density estimator can be turned into a Bayesian 
one by the following three steps: 

1. Transform the estimator into a likelihood for the smoothness parameters via 
the sum over paths technique introduced below. 

2. Put a reference prior on the smoothness parameters. 

3. Use the predictive distribution as the Bayesian estimator obtainable by Markov 
Chain Monte Carlo. 

The summing over paths method, springs from the simple idea of interchanging 
sums and products in the expression for the cross- validated kernel likelihood (see 
^). Without further reference to its humble origins, let us postulate the following 
sequence of functions, 

n 
^n=^ {xo,Xi,X2,-..,Xn\h) (X ^ J]^ i^,j (xj - X^^ ) (1) 

all paths j^O 

where h > 0, Kh{x) = j^K{x/h) with K a density symmetric about 0. We call a 
vector of indices, (iq, . . . , in) with the property that ij € {0, . . . , n} \ {j}, a path 
(more specifically a general unrestricted path, see below) . The sum above runs over 
all the n"+^ possible paths. The functions, $„ are defined up to a proportionality 
constant independent of the Xj's. 

Notice that by flipping the sum and the product we get 



1 



X! n ^'^ (^3 ~ ^^j) = /-o,«(a;o)/-i,n(a;i) • • • /_„,„(a;„) (2) 

all paths J^O 



where, 



1 " 
f-jA^i) = -'^^^h {Xj -x,^) . (3) 

Thus, f-^j^n{xj) is nothing but the kernel density estimator of f{xj) based on all 
the data except the jth observation Xj. Under mild regularity conditions the kernel 
estimator is known to converge (in every conceivable way) provided that h = hn 
is taken as a function of n such that, /i„ — > and nh^ ^ cxd as n ^ cxd. 
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The $n's can be used as a universal method for attaching a class of exchange- 
able one parameter models to any set of observations. The positive scalar param- 
eter h is the only free parameter, and different models are obtained by changing 
the kernel function K. 

These empirical parametric models are invariant under relabeling of the ob- 
servations (i.e. they are exchangeable) but they do not model the observations as 
independent variables. Rather, these models introduce a pattern of correlations for 
which there is a priori no justification. This suggest that there might be improve- 
ments in performance if the sum is restricted to special subsets of the set of all 
n""'""'^ paths. Three of these modifications are mentioned in the following section. 

Notice also that the ability of the $„ to adapt comes at the expense of regu- 
larity. These models are always non-regular. If the kernel has unbounded support 
then $„ integrates to infinity but the conditional distribution of xq given xi, . . . , a;„ 
and h is proper. When the kernel has compact support the $„ are proper but still 
non-regular since their support now depends on h. 

The above recipe would have been a capricious combination of symbols if it 
not were for the fact that, under mild regularity conditions, these models adapt to 
the form of the true likelihood as n increases. 

As a function of xq = x, the $„ have the following asymptotic property. 

Theorem 1 If xi,X2, ■■■ ,Xn are iid observations from an unknown pdf f which 
is continuous a.s. and h = h^ is taken as a function of n such that, ft,„ ^ and 
nhn ^ oo as n ^ oo, then, 

j^^ = /-o,„(a;) + o(l) = fix) + oil). (4) 

where the little a is taken in probability as n ^ oo 

Proof (sketch only) 
Just flip the sum and the product to get again, 

1 " 

~:^ Yl Yl^hixj -X^^) = f-0.nix)f-i.nixi)---f-n,nix„) (5) 

all paths j^O 

Under the simple regularity conditions of the theorem, the kernel estimator is 
known to converge in probability as n ^ oo. However, even though xq appears 
in all of the n + I factors, and their number goes to infinity, still all the factors 
are converging to the value of the true density at the given point. Therefore the 
theorem follows. D 

It is worth noticing that the above theorem is only one of a large number of 
results that are readily available from the thirty years of literature on density 
estimation. In fact under appropriate regularity conditions the convergence can be 
strengthen to be pointwise a.s., uniformly a.s., or globally a.s. in Li, or L2. 

3. Paths, Graphs and Loops 

Each of the ri"+^ paths (io, . . . ,in) can be represented by a graph with nodes 
xq, . . .,Xn and edges from Xj to Xk if and only if ij = k. Here are some graphs 



CARLOS C. RODRIGUEZ 



X2)—- (Xl 



xa) 1x3 



Figure 1. The graph of (2, 3, 1, 2) 

for paths with n = 3. For example, the path (2,3,1,2) is given a probabihty 
proportional to 

Kh{xo - X2)Kh{xi - xz)Kh{x2 - xi)Kh{xz - X2) (6) 

and represented by the graph in figure |1|. Let's call it a 1-3-loop. The path 
(1, 2, 3, 0) is the single ordered loop of size four (a 4-loop), (3, 0, 1, 2) is the same 
loop backwards (also a 4-loop), (2, 3, 0, 1) are two disconnected loops (a 2-2-loop) 
and (1, 0, 0, 0) is connected and contains a loop of size two with xq and xi in it (a 
1-1-2-loop). Draw the pictures! 

The classification of paths in terms of number and size of loops appears natu- 
rally when trying to understand how $„ distributes probability mass among the 
different paths. To be able to provide simple explicit formulas let us take K in the 
definition of <i>„ to be the standard gaussian density, i.e. from now on we take, 



The gaussian kernel has unbounded support and that makes the total integral of 
each path to diverge. Thus, the partition function 

<^ndxo . . . dxn (8) 

/n 
Y\^ Kh {xj - X.,. )dxo... dxn (9) 

all patna j = {) 

is the sum of infinities and it also diverges. Recall that this anomaly is the price 
we need to pay for using a model with a finite number of free parameters (only 
one in this case) and hoping to still adapt to the form of the true likelihood as 
n ^ 00. Even though the value of Z is in fact 00 we can still write (formally) a 
useful decomposition that will help explain how the <i>„'s adapt and how to modify 
the set of paths to improve the convergence. We first need the following simple 
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property of gaussians, 

/ Ka{x - y)Kb{y - z)dy = K^^^2+^{x - z). (10) 

This can be shown by straight forward integration after completing the square. 
Now notice that whatever the value of the integrals appearing in equation (g) that 
value only depends on the type of loop that is being integrated. For this reason we 
omit the integrand and simply denote the value of the integral with the integral 
sign and the type of loop. With this notation we have, 



Theorem 2 



/mi-m2-...-rn^-loop \Jmi-loopJ \Jm2-l00pJ \j nik-loop j 

More over, Ji^Iqqy) — 1 "■''T'd for m > 1, 

1 /i-i 



(11) 



(12) 



i-loop v27rv"^ 
where we write formally L — J dx. 

Proof 

Equation ( pi] ) follows from Fubini's theorem. To get ( p^ ) use Fubini's theorem and 
apply (|0|) each time to obtain, 

Kh{xo - xi)Kh{xi - X2)Kh{x2 ~ X'i) ... Kh{x,n^i ~ xo)dxodxi . . . dXm-l 

^V2h(^0 - X2)... Kh{Xm-l - Xo)dxodx2 ■ . . dXra^i 
^V^^^^^^hi^O - Xjn-l)Kh{Xm-l - Xo)dxodXm-l 

1 /l^l 

^ V2tt Jm 



-loop 



D 

Hence, by splitting the sum over all paths into, 

E - E + E +•••+ E 

all paths 2-2...-2-Ioops 1-3-2. . .-2- loops (n-|-l)-Ioops 

and applying the previous theorem we obtain, 



Z^N2-2...-2\^=^=L\ +...+iVn+l ^ , - L] (13) 

V\/2^\/2 / V \/2^ V"- + 1 / 

where for simplicity we have assumed that n is odd and we denote by Nmi-...-mk 
the total number of nii — ... — m^. — loops. Using simple combinatorial arguments 
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it is possible to write explicit formulas for the number of loops of each kind. 
The important conclusion from the decomposition (|lj) is that even though the 
<&„ appear to be adding equally over all paths, in reality they end up allocating 
almost all the probability mass on paths with maximally disconnected graphs. 
This is not surprising. This is the reason why there is consistency when assuming 
iid observations. There is a built in bias towards independence. The bias can be 
imposed explicitly on the $„ by restricting the paths to be considered in the sum. 
Here are three examples: 

loops: 

Only paths (iq, . . . , i„) that form a permutation of the integers {0, 1, . . . , n} 

are considered. 
2 - 2 - ... - 2-loops: 

Only maximally disconnected paths are considered. 
QM: 

Paths as above but use |$„| instead of $„ as the joint likelihoods. 

Preliminary simulation experiments seem to indicate that only maximally dis- 
connected paths are not enough and that all the loops are too many. The QM 
method has all the maximally disconnected paths but not all the loops (e.g. with 
n = 5 the 3-3-loops can not be reached by squaring the sum of 2-2-2-loops) so it 
looks like the most promising among the three. What is more interesting about the 
QM method is the possibility of using kernels that can go negative or even complex 
valued. More research is needed since very little is known about the performance 
of these estimators. 

4. Estimation by MCMC 

We show in this section how to approximate the predictive distribution and the 
bayes rule for the smoothness parameter by using Markov Chain Monte Carlo 
techniques. 

4.1. POSTERIOR MEAN OF THE BANDWIDTH 

Apply bayes' theorem to (^ to obtain the posterior distribution of h, 
,,, ^ ^ix,Xi,X2,...,Xn\h)n{h) 

Tr{h\x,Xi,X2,.-.,Xn) = poo ,. \ \ ^ \j \^V 

Jo 9{x,Xi,X2,.-.,Xn\T)7r{T)dT 

where tt is a function of h. It is worth noticing that n is not the prior on h. It is 
only the part of the prior on h that we can manipulate. Recall that $„ integrates 
to the function of h given by (O) so effectively the prior that is producing (OJ) is, 

n(M « j0^ (15) 

The posterior mean is then given by, 

- J^ h<^{x,Xi,X2,...,Xn\h)n{h)dh 
E{h\x,Xi,X2,-.-,Xn) ^ h^ = r°°^/ iu\ ru\ji. (l^j 

Jg 9(X,Xi,X2,-. ■,Xn\h)'K{h)dh 
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Equation ( [iq ) provides a different estimator for each value of a;. To obtain a single 
global estimate for h just erase the x's from (|6|) and change n to n — 1 in the 
formulas below. When K is the univariate gaussian kernel and Tr{h) = h^ equation 
dig ) simplifies to: 

V ,, ^^ a(i)s(i) 

7 /-» A^all paths V— / V— / /., — -. 

2^all paths '^V-i 

where, 

1 r(2+^^) 

^"'^ = 71 r (4^) ^'^) 

z = (io, • ■ • , in) is a path, a — s^("+*) and 



S2 



{i) = ^ix,~x,^r. (19) 



Equation (n^) follows from two applications of the formula, 

^°° /i-(^+i) exp [-^]dh = l2P/^T{P/2)s-^ (20) 



4.1.1. Bandwidth by Metropolis 

To approximate equation ( p7| ) we use the fact that the ratio of the two sums is 
the expected value of a random variable that takes the value s(i) on the path i 
which is generated with probability proportional to a{i). The following version of 
the Metropolis algorithm produces a sequence of averages that converge to the 
expected value. 

Algorithm 

0) Start from somewhere 
i<^(l,2^...,n,0) 

s2 <— Z^j^o{xj — Xi-) 

a ^ (s2)-("+*)/2 

A^ ^ 0, sum 4— 0, ave ^ 

1) Sweep along the components of i 

for k from to n do 

{ 



^L<- 


- Uniform on {0, . . . , n} \ {fc, ife} 


Afc. 


^ {Xi'^ - X^^){XV^ + Xi^ - 2xk) 


s2' i 


- s2 + Ak 


a' ^ 


- (s2')-("+*)/' 


R^ 


-a' /a 
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if i? > 1 or Unif[0,l] < R then {u ^ i'^, s2 ^ s2', a <~ a'} 
} 

2) Update the estimate for the average, 
sum ^ sum +vs2 
N ^ N + 1 
ave ^ sum/A^ 
goto 1) 



4.2. THE PREDICTIVE DISTRIBUTION BY GIBBS 

To sample from the predictive distribution, /(a;|a;i, a;2, . . . ,a;n) we use Gibbs to 
sample from the joint distribution, /(x, h\xi, X2, . • . , a;„). Hence, we only need to 
know how to sample from the two conditionals, a) f{x\h, Xl^X2^ ■ ■ ■ ,Xn) and b) 
Tr(h\x,xi,X2, ■ ■ ■ ,Xn)- To sample from a) we use the fact that this is (almost) 
the classical kernel so all we need is to generate from an equiprobablc mixture 
of gaussians. To get samples from b) just replace the gaussian kernel into the 
numerator of equation (H) to obtain, for Tr{h) ex h^ , 

TT{h\x,Xi,X2,...,Xn)^ J2 ''^^"^'^'^ ^''P J" |^ J ' ^"^^^ 

all paths ^ ^ 

The integral with respect to h of each of the terms being added in (|2^) is pro- 
portional to s~^"+''-' (see (po|)). Thus, by multiplying and dividing by this integral 
each term, we can write, 

TT{h\x,Xi,X2,.-.,Xn)cC ^ a(i)7rs(j) (/l) (22) 

all paths 

where a{i) = (.s(z))~'^^^+'^^ as before and, 

^,(/i)(x/i-("+^+i)exp|-^| (23) 

From the change of variables theorem it follows that if y is Gamma(^^^, 1) then 
/i = w |- follows the distribution ([2^). This shows that the posterior distribution of 

his a mixture of transformed gamma distributions. This mixture can be generated 
by a simple modification to the algorithm used to get the posterior mean of h. 

5. Experiments on simulated and real data 

I have coded (essentially) the algorithm described in the previous section in MAPLE 
and tested it dozens of times on simulated data for computing a global value for 
the smoothness parameter h. All the experiments were carried out with 6=1 
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Solid=0 . 5N (-1, . 5) +0 . 5N (1, 1) , n=50, 3bumps=CV, 2bumps=MCMC 




-6 -4 -2 



Figure 2. Posterior mean of global h vs plain cross-validation 



i.e. with 7r(/i) = h~^ on mixtures of gaussians. The experiments clearly indi- 
cate that the global value of h provided by the MCMC algorithm produce a 
kernel estimator that is either identical to plain likelihood cross-validation or 
clearly superior to it depending on the experiment. A typical run is presented 
in figure where the true density and the two estimators from 50 iid observa- 
tions arc shown. The MAPLE package used in the experiments is available at 
http:/ /omega, albany. edu:8008/npde. mpt . 



For comparison with other density estimators in the literature we show in figure 
H the estimate for the complete set of 109 observations of the Old Faithful geyser 
data. These data are the 107 observations in H plus the two outliers 610 and 620. 
This is a standard gaussian kernel with the global value of ft, = 14.217 chosen by 
the MCMC algorithm. 
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Figure 3. Estimate for the Old Faithful geyser data, h = 14.217 



6. Conclusions 

There is nothing special about dimension one. Only minor cosmetic changes (of 
the kind: replace h to h"^ in some formulas) are needed to include the multivariate 
case, i.e. the case when the Xj's are d-dimensional vectors instead of real variables. 
Very little is known about these estimators beyond of what it is presented in 
this paper, In particular nothing is known about rates of convergence. There are 
many avenues to explore with theory and with simulations but clearly the most 
interesting and promissing open questions arc those related to the performance of 
the QM method above. 
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